This report explores a dataset containing information regarding the quality of red wine for approximately 1,599 wines. The purpose of this exploration is to try and get a better understanding of which chemical properties influence the quality of red wines?
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
The summary of the wine data confirms we have 1,599 wine samples that the analysis will be based on, with 12 attributes for each wine. Our target variable of interest is quality as we want to determine if there are other features in the data set that are particularly correlated to a higher or lower quality of wine.
However it looks like there is one variable in there (“X”) that is not actually descriptive of my data set and is rather just an index in the data that increments for each row. Let’s remove that.
Now let’s reprint those names and see if that looks better.
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
It looks as though some scaling will be required as the range of values for free.sulfur.dioxide and total.sulfur.dioxide seem to be on a drastically different scale than the remainder.
Let’s look at the distribution of wine quality over the data set.
The histogram demonstrates that the wine samples roughly follow a normal distribution, with the majority of wines scoring a 5 or a 6 on the quality scale. This is confirmed from the summary above showing the mean of 5.636 and median of 6. Interestingly no wines scored a 9 or 10, and nothing scored below a 3. I wonder if this is due to some inherent bias from the individual assigning quality scores, of not wanting to grossly over or under mark any wines that they tested.
The bias in this data toward the middle of the quality curve is likely to mean that we can find predictors of mid quality wine, but not necessarily high quality (as the data will be too sparse to effectively identify high quality wines).
So, on to looking for the magic ingredients that make wine “good” quality. Let’s start by plotting every feature independently on a histogram.
Fixed acidity:
Volatile acidity:
## Warning: Removed 4 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).
Citric acid:
## Warning: Removed 1 rows containing non-finite values (stat_bin).
Residual sugar:
## Warning: Removed 29 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).
Then chlorides
## Warning: Removed 43 rows containing non-finite values (stat_bin).
Free sulfur dioxide:
## Warning: Removed 16 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).
Total sulfur dioxide:
## Warning: Removed 9 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).
Density:
pH level:
## Warning: Removed 8 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).
Sulphates:
## Warning: Removed 17 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).
Alcohol:
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).
Now that we have a better understanding of the distribution of each individual element, let’s compare features to quality to see if trends start to emerge on key features of high quality wine.
Maybe more alcoholic wines are considered of higher quality? I will look at these two features plotted against each other to see if there is any association between the two.
This plot indicates there is a weak relationship between alcohol content and quality of wine, as alcohol content increases there is a moderate relationship to higher quality wines (with very low quality wines all being relatively low alcohol content). However this does not visually appear to be highly correlated as there is a wide distribution of strength and quality once the quality score reaches 5 or greater.
Based on the descriptive information that helps readers understand what these different features relate to, there are some other potential characteristics of interest.
Let’s explore the relationship between quality and volatile acidity. Based on the descriptive information we should expect to see a ceiling on quality once volatile acidity reaches over a given amount.
There is definitely a trend here, with very few wines having more than 0.8 gm of acetic acid in the highest quality buckets - but interestingly not many wines in general seem to be over this level (even the lowest quality wines).
Next I want to look at residual sugar levels, to see if sweet red wine is considered higher quality than dry red wine.
## Warning: Removed 86 rows containing missing values (geom_point).
It looks as though the higher the quality wine will contain less residual sugar - indicating that sweet red wines are not considered to be of high quality. What about chlorides / salt? It would seem that salty wine is not desirable, so I think that we’ll see low salt in wine across the board regardless of quality level.
## Warning: Removed 69 rows containing missing values (geom_point).
As expected, the highest quality wines all have very low levels of chlorides. The interesting part is that mid quality wines seem to have the highest salt levels compared to even the lower quality wines (see the upper bound on quality 5). I want to look at this as a boxplot to see if that shows a better story of the distribution of this particular feature.
## Warning: Removed 69 rows containing non-finite values (stat_boxplot).
The box plot clearly shows the mid tier wines having significant outliers in terms of salt content (mostly in a negative direction). This is very interesting, and seems like it could be a candidate for an ingredient to move a mid tier wine to a top tier wine (controlled salt content).
Continuing the investigation of individual attributes related to quality, next I want to look at the different types of sulfur and their relationship to quality.
First free sulfur dioxide:
## Warning: Removed 45 rows containing missing values (geom_point).
And total sulfur dioxide:
## Warning: Removed 9 rows containing missing values (geom_point).
Again, similar trends as would be expected based on the descriptive information provided about these attributes. Higher quality wines generally seem to have a lower quantity of sulfur dioxide (either free or total), however there are wines at lower quality levels that have the same amount of sulfur - so that is indicative that this is not a silver bullet.
What about density of the water or pH level of the wine?
Water density:
pH level:
Neither density or pH seem to have a strong relationship with quality; both attributes are fairly distributed for high quality wines and there is no strong trend that can be drawn (positive or inverse) vs. wine quality based on these plots.
Finally I want to look at the level of sulphates in the wine:
## Warning: Removed 17 rows containing missing values (geom_point).
Again this plot seems to indicate there is a small relationship between quality and amount of sulphates (there should be a little, but not a lot, somewhere in the 0.75 range).
Based on all these individual features vs. quality plots - there is no standout “golden bullet” that is required to make a wine high quality. There are things that should be avoided (such as high salt content) but no one thing that will give a high quality wine - which is generally what I had expected before starting this analysis given the complexity of the subject area.
Therefore we need to look at multiple elements in combination to see if there are groups of attributes that do have a direct impact on the quality of wine.
## Warning: Removed 22 rows containing missing values (geom_point).
This plot is now starting to get interesting. Showing the relationship between the amount of chlorides (which we already knew was lower for high quality wines), and alcohol content. We can see an increasing relationship between the high alcohol content wines (shown by the lighter dots), combined with low chlorides.
I broke out the jitter plot to try and add more insight into what was happening in the over crowded 5/6 quality zone.
Let’s see what other relationships exist.
Let’s use facets instead of a visual determinant to help visually separate the different quality categories
## Warning: Removed 21 rows containing missing values (geom_point).
The graphs above show that there is a tighter clustering of volatile acidity (confirming what we had looked at previously as an independent variable), however fixed acidity does not seem to have such a direct correlation to quality of wine.
## Warning: Removed 108 rows containing missing values (geom_point).
Similar story here, residual sugar and chlorides must be low for a high quality wine. There is not a huge difference in clustering in a quality 8 wine even when compared to the quality 4 wines though.
These features do not seem to be important toward quality detection. The pattern and the relationship of sulfur dioxide levels is pretty consistent between low and high quality wines (high quality wines are more clustered, but there are also less wines at this level).
What features is the model selecting in predicting the quality of a wine? This will help ensure that I’m looking at the right elements in my final plots.
Looking at a graph of feature importance, it looks by far like the most important feature is alcohol content, followed by sulphates and volatile.acidity. However the features in the dataset are much more likely to be predictors for whether a wine is medium quality - rather than low or high quality (which is interesting).
Based on a fresh perspective from this model, let’s replot alcohol content and sulphates against each other.
The feature importance in the graph would seem to be highlighting the cluster of low alcohol wines that are in the “5” quality bucket; likely the thing that the random forest model picked up on (that lower alcohol wines are correlated to a 5 or 6 quality wine based on this data set).
Finally to make sure that we haven’t missed anything, a quick plot of the pearson correlation of all the attributes in our model to ensure that we haven’t missed any valuable plots before we move into the final analysis.
I think based on the analysis done so far in this project, I can now start to finalise my plots and draw some conclusions.
Let’s start by re-iterating the purpose of this data exploration:
Which chemical properties influence the quality of red wines?
The best way to start answering a question such as the one above, is to look at the relative correlations of all the feautres in the dataset available, plotted against each other. However more than what I did in the plot above, we should ensure that the chart is visually representative of the hotspots of correlation and whether the correlations are significant (statistically) or not.
The chart above highlights that there are a handful of features positively correlated with wine quality, namely:
In addition, there are some features that are negatively correlated with the quality of the wine:
From my earlier analysis, there was an interesting relationship present between the Alcohol content, the Sulphates in the wine, and the quality score attained. Let’s visualise that relationship:
## Warning: Removed 14 rows containing missing values (geom_point).
The plot above shows a relationship between all the provided attributes - we can see as wine quality is increasing that there is an increasing amount of citric acid (however not reaching the top of the citric acid range, settling between 0.50 and 0.75 g/dm^3), a similar amount of potassium sulphate and at least 12% alcohol by volume.
Similarly, we can plot the features that were negatively correlated with quality to see how that differs on a plot:
## Warning: Removed 9 rows containing missing values (geom_point).
The plot above demonstrates the less correlated features, with higher sulfur dioxide content, higher volatile acidity or higher denisty values being associated with lower quality wines. Note that as with the results noted previously from the Random Forest model employed during my initial analysis, it is possible that the skew in the sample data is affecting these results.
That is, it is possible that due to the lower quantity of “high quality” wines, that we are missing out on correlating these features with high quality wines purely due to the sparsity of data.
So in conclusion, if a wine maker would like to ensure their wines are of high quality - there is unfortunately no silver bullet. The US wine market is responsible for approximately $220bn worth of value each year toward the american economy - so it stands to reason this is a complicated problem.
The analysis above highlights if you want to improve quality there are 6-8 features of your wine that you can double down on which will maximise your chances of producing a good wine, so get mixing!
The red wine data set contained 1599 data samples covering 13 variables from around 2009. I started out looking at what variables were available in the dataset, and looking at the distribution of my data. Then based on the descriptive text provides about the attributes I started exploring the data to see if there were any immediately obvious relationships between the features and quality of the wine. After bottoming out individual features, I then started exploring pairs of features and did start to notice more trends emerging.
I then took a turn in my analysis and decided to use a more data driven approach, opting to build a simple ML model (using a random forest predictor) to try and classify the data based on the available features and highlight which features were the most predictive of quality. The model had varying levels of success, mainly due to the highly sparse data, and was more predictive for 5/6 quality wines rather than all wines.
Using a data driven approach seemed to work well, and helped rather than manually exploring all the potential pairs of features in the data set. However to improve the data set would need to be larger (ideally updated from 2009 to current date), so we could have more data and potentially even examine the trend in quality vs. chemical composition over time as potentially tastes have changed in the last 8-9 years.
More data would help here as the sparsity of the data set and bias toward the middle quality tiers causes the data to be less predictive (e.g. the model will say most things are a 5 or 6, because most things were a 5 or 6 in the provided data).